24 research outputs found
Learning Descriptors for Object Recognition and 3D Pose Estimation
Detecting poorly textured objects and estimating their 3D pose reliably is
still a very challenging problem. We introduce a simple but powerful approach
to computing descriptors for object views that efficiently capture both the
object identity and 3D pose. By contrast with previous manifold-based
approaches, we can rely on the Euclidean distance to evaluate the similarity
between descriptors, and therefore use scalable Nearest Neighbor search methods
to efficiently handle a large number of objects under a large range of poses.
To achieve this, we train a Convolutional Neural Network to compute these
descriptors by enforcing simple similarity and dissimilarity constraints
between the descriptors. We show that our constraints nicely untangle the
images from different objects and different views into clusters that are not
only well-separated but also structured as the corresponding sets of poses: The
Euclidean distance between descriptors is large when the descriptors are from
different objects, and directly related to the distance between the poses when
the descriptors are from the same object. These important properties allow us
to outperform state-of-the-art object views representations on challenging RGB
and RGB-D data.Comment: CVPR 201
Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping
Instrumenting and collecting annotated visual grasping datasets to train
modern machine learning algorithms can be extremely time-consuming and
expensive. An appealing alternative is to use off-the-shelf simulators to
render synthetic data for which ground-truth annotations are generated
automatically. Unfortunately, models trained purely on simulated data often
fail to generalize to the real world. We study how randomized simulated
environments and domain adaptation methods can be extended to train a grasping
system to grasp novel objects from raw monocular RGB images. We extensively
evaluate our approaches with a total of more than 25,000 physical test grasps,
studying a range of simulation conditions and domain adaptation methods,
including a novel extension of pixel-level domain adaptation that we term the
GraspGAN. We show that, by using synthetic data and domain adaptation, we are
able to reduce the number of real-world samples needed to achieve a given level
of performance by up to 50 times, using only randomly generated simulated
objects. We also show that by using only unlabeled real-world data and our
GraspGAN methodology, we obtain real-world grasping performance without any
real-world labels that is similar to that achieved with 939,777 labeled
real-world samples.Comment: 9 pages, 5 figures, 3 table
Open-World Object Manipulation using Pre-trained Vision-Language Models
For robots to follow instructions from people, they must be able to connect
the rich semantic information in human vocabulary, e.g. "can you get me the
pink stuffed whale?" to their sensory observations and actions. This brings up
a notably difficult challenge for robots: while robot learning approaches allow
robots to learn many different behaviors from first-hand experience, it is
impractical for robots to have first-hand experiences that span all of this
semantic information. We would like a robot's policy to be able to perceive and
pick up the pink stuffed whale, even if it has never seen any data interacting
with a stuffed whale before. Fortunately, static data on the internet has vast
semantic information, and this information is captured in pre-trained
vision-language models. In this paper, we study whether we can interface robot
policies with these pre-trained models, with the aim of allowing robots to
complete instructions involving object categories that the robot has never seen
first-hand. We develop a simple approach, which we call Manipulation of
Open-World Objects (MOO), which leverages a pre-trained vision-language model
to extract object-identifying information from the language command and image,
and conditions the robot policy on the current image, the instruction, and the
extracted object information. In a variety of experiments on a real mobile
manipulator, we find that MOO generalizes zero-shot to a wide range of novel
object categories and environments. In addition, we show how MOO generalizes to
other, non-language-based input modalities to specify the object of interest
such as finger pointing, and how it can be further extended to enable
open-world navigation and manipulation. The project's website and evaluation
videos can be found at https://robot-moo.github.io/Comment: Accepted at the 7th Conference on Robot Learning (CoRL 2023